Hybrid model for DNA probe design and validation using nonlinear and linear regression methods

ABSTRACT

Methods and systems for selecting oligonucleotide probes for use in microarray applications are provided herein. The described methods use a combination of measured probe performance and predicted probe performance to select probes. Nucleic acid arrays containing probes selected by the described methods are described. Also included are algorithms for performing the subject methods recorded on computer-readable media and computational systems for analysis.

CROSS REFERENCE TO RELATED APPLICATIONS

This Application is a continuation-in-part of, and claims priority to,U.S. patent application Ser. No. 10/996,323, filed Nov. 23, 2004.

BACKGROUND

Comparative genomic hybridization (CGH) and location analysis areimportant applications, which allow scientists to make biologicalmeasurements involving genomics, cytogenetics, and study expression andregulation of genes in biological systems. Both CGH and locationanalysis entail quantifying or measuring changes in copy number ofgenomic sequences in biological or medical samples. CGH, is particularlyimportant in developmental biology as well as the causes of cancer andoffers great potential in the diagnostics of cancer and developmentaldiseases. Recently, cDNA microarrays have been used for CGH studies. Anoligo-array based approach has several substantial advantages over othertechnologies, in that it allows the designer to position the probesanywhere within the genomic or polynucleotide sequence of interest. Theprobes can be placed at any set of loci or positioned to span anygenomic intervals of interest at whatever density is commensurate withthe real-estate or area available on the microarray (in terms of numberof features). The copy numbers of DNA over the genomic regions ofinterest can be evaluated by analyzing the hybridization of targetsequences to the surface-bound probes. The oligonucleotide probeapproach also offers the flexibility of focusing in on regions withinexons or introns of expressed sequences, including pre-microRNAs orintergenic regions and regulatory regions for location analysis, as wellas any desirable admixture of the aforementioned.

Probes that work well on microarrays for gene expression generally donot work well for CGH arrays and are not appropriate for locationanalysis arrays. The overall performance of probes for CGH and locationanalysis arrays entails different optimization of their properties thanprobes utilized for gene expression. Most notably, these differencesrelate to the substantially increased complexity of the labeled targetmixture for CGH and location analysis than for expression analysis whichdemands a greater specificity of the probes in discriminating againstnon-specific binding to competing targets. For comparison, the totalnumber of nucleotide bases in the human transcriptome is approximately10⁸, while the human genome contains over 3×10⁹ bases. Additionally,probes selected for gene expression come from within message sequencesthat are transcribed as RNA, i.e. exons, while probes for CGH need becomplementary, or nearly so, to contiguous targets selected from withina genome sequence e.g. introns and/or exons.

Despite great interest in CGH technology, methods for evaluating probesin silico and also empirically for use in this technology are limited. Arigorous method would be to measure signals (e.g. ratios of signals)from each polynucleotide in controlled experiments with test samplescontaining known copy numbers for each probe sequence on the array. Forexample, a method used by several probe designers for measuring arrayperformance for sets of polynucleotides specific for sequences on the Xchromosome, is to use a series of cell lines with known variable copiesof the X chromosome for CGH experiments. See, e.g., M. T. Barrett etal., Proc. Natl. Acad. Sci. USA 101(51): 17765-70 (2004). These celllines (X series) are homogeneous and contain intact copies (e.g. 1 to 5)of the X chromosome permitting a rigorous measure of the relationshipbetween copy number and signal intensities for each X chromosomespecific polynucleotide on an array. However, cell lines containingknown variable numbers of intact copies of most other chromosomes arenot readily available. Furthermore, the aberrant X series cell lines areslow growing and can spontaneously vary in ploidy under standardculturing conditions. Such methods are complex and time-consuming andcannot readily be used to assay the relationship between thehybridization signal of polynucleotides on an array and the genomic copynumber of sequences from each chromosome in a cell.

SUMMARY

This disclosure relates to methods for predicting probe performance formicroarray applications. The methods described herein optimize probeperformance by measuring the probe response in a model system andapplying that response to predict the response for probes that have notyet been experimentally tested.

Methods for selecting an oligonucleotide probe with the best performancein a microarray application are provided herein. In an aspect, themethods include generating candidate probes and screening the probeswith one or more metrics or parameters that can predict or classifyprobe performance. The resulting probe scores for each metric arecombined using various statistical methods, and the probe with the bestcombined score is selected. In aspects, the methods described herein canbe modified to obtain probes within a very narrow range of predictedproperties for the probes.

Algorithms for performing the described methods recorded oncomputer-readable medium, as well as computations analysis systems thatinclude the same are also provided. The disclosure also includes nucleicacid arrays with oligonucleotide probes whose performance is predictedusing the subject methods, and methods using such arrays.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart generally depicting the methods described herein.

FIG. 2 is a flowchart showing a method to generate candidate probeswhose performance is predicted by the methods described herein.

FIG. 3 is a flowchart depicting methods for calculating slope andcombining measured slope with calculated parameters to predict probeperformance.

FIG. 4 shows a distribution of measured slope against duplex meltingtemperature.

FIG. 5 shows a plot of smoothed measured slope against duplex meltingtemperature.

FIG. 6 shows a trend curve with a fitted curve of a 12th-orderpolynomial vs. duplex melting temperature.

FIG. 7 shows a graph of the fitted slopes vs. measured slopes forcombined metrics.

FIG. 8 shows measured slope plotted against duplex melting temperaturetrend curves and various synthetic replacement curves.

FIG. 9 shows T_(m) distributions resulting from the use of combinedsynthetic and empirical scores.

DETAILED DESCRIPTION

Various embodiments will be described in detail with reference to thedrawings, wherein like reference numerals represent like partsthroughout the several views. Reference to various embodiments does notlimit the scope of the claims attached hereto. Additionally, anyexamples set forth in this specification are not intended to be limitingand merely set forth some of the many possible embodiments for theclaims.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art. Although any methods, devices and material similar orequivalent to those described herein can be used in practice or testing,the methods, devices and materials are now described.

All publications and patent applications in this specification areindicative of the level of ordinary skill in the art and areincorporated herein by reference in their entireties.

In this specification and the appended claims, the singular forms “a,”“an,” and “the” include plural reference, unless the context clearlydictates otherwise. Unless defined otherwise, all,technical andscientific terms used herein have the same meaning as commonlyunderstood to one of ordinary skill in the art.

Definitions

The term “genome” refers to all nucleic acid sequences (coding andnon-coding) and elements present in or originating from a single cell,or from each cell type in an organism, or from a virus. The term“genome” encompasses all sources of genomic sequences or elements knownto those of skill in the art. The term genome also applies to anynaturally occurring or induced variation of these sequences that may bepresent in a mutant or disease variant of any virus or cell type. Thesesequences include, but are not limited to, those involved in themaintenance, replication, segregation, and higher order structures (e.g.folding and compaction of DNA in chromatin and chromosomes), or otherfunctions, if any, of the nucleic acids as well as all the codingregions and their corresponding regulatory elements needed to produceand maintain each particle, cell or cell type in a given organism.

For example, the human genome consists of approximately 3×10⁹ base pairsof DNA organized into distinct chromosomes. The genome of a normaldiploid somatic human cell consists of 22 pairs of autosomes(chromosomes 1 to 22) and either chromosomes X and Y (males) or a pairof X chromosomes (female) for a total of 46 chromosomes. A genome of acancer cell may contain variable numbers of each chromosome in additionto deletions, rearrangements and amplification of any subchromosomalregion or DNA sequence.

The terms “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 10bases, greater than about 100 bases, greater than about 500 bases,greater than 1000 bases, usually up to about 10,000 or more basescomposed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides,or compounds produced synthetically (e.g., PNA as described in U.S. Pat.No. 5,948,902 and the references cited therein) which can hybridize withnaturally occurring nucleic acids in a sequence specific manneranalogous to that of two naturally occurring nucleic acids, e.g., canparticipate in Watson-Crick base pairing interactions.

The terms “ribonucleic acid” and “RNA” as used herein mean a polymercomposed of ribonucleotides.

The terms “deoxyribonucleic acid” and “DNA” as used herein mean apolymer composed of deoxyribonucleotides.

The term “oligonucleotide” as used herein denotes single strandednucleotide multimers of from about 10 to 100 nucleotides and up to 200nucleotides in length. Oligonucleotides are usually synthetic and, inmany embodiments, are under 50 nucleotides in length.

The term “oligomer” is used herein to indicate a chemical entity thatcontains a plurality of nucleotide monomers, i.e., a nucleotidemultimer. As used herein, the terms “oligomer” and “polymer” are usedinterchangeably, as it is generally, although not necessarily, smaller“polymers” that are prepared using the functionalized substrates of theinvention, particularly in conjunction with combinatorial chemistrytechniques. Examples of oligomers and polymers includepolydeoxyribonucleotides (DNA), polyribonucleotides (RNA), other nucleicacids that are C-glycosides of a purine or pyrimidine base, polypeptides(proteins), polysaccharides (starches, or polysugars), and otherchemical entities that contain repeating units of like chemicalstructure.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically, although not necessarily, in fluid form,containing one or more components of interest. Samples include, but arenot limited to, biological samples obtained from natural biologicalsources, such as cells or tissue. The samples may also be derived fromtissue biopsies and other clinical procedures.

The terms “nucleoside” and “nucleotide” are intended to include thosemoieties that contain not only the known purine and pyrimidine bases,but also other heterocyclic bases that have been modified. Suchmodifications include methylated purines or pyrimidines, acylatedpurines or pyrimidines, alkylated riboses or other heterocycles. Inaddition, the terms “nucleoside” and “nucleotide” include those moietiesthat contain not only conventional ribose and deoxyribose sugars, butother sugars as well. Modified nucleosides or nucleotides also includemodifications on the sugar moiety, e.g., wherein one or more of thehydroxyl groups are replaced with halogen atoms or aliphatic groups, orare functionalized as ethers, amines, or the like.

The phrase “surface-bound polynucleotide” refers to a polynucleotidethat is immobilized on a surface of a solid substrate, where thesubstrate can have a variety of configurations, e.g., a sheet, bead, orother structure. In certain embodiments, the collections ofoligonucleotide probe elements employed herein are present on a surfaceof the same planar support, e.g., in the form of an array.

The phrase “labeled population of nucleic acids” refers to mixture ofnucleic acids that are detectably labeled, e.g., fluorescently labeled,such that the presence of the nucleic acids can be detected by assessingthe presence of the label. A labeled population of nucleic acids is“made from” a chromosome sample, the chromosome sample is usuallyemployed as template for making the population of nucleic acids.

A “biological model system,” or “model system,” as provided herein,refers to a system for which a quantitative response in a microarraysystem can be expected with certainty (i.e. a system wherein a responsecan be detected or measured). Exemplary model systems include, withoutlimitation, biological systems, such as titration series with differentRNA samples at different concentrations, samples with known genomicaberrations, samples to be used for comparative genomic hybridizationexperiments, etc. The biological model systems are used to performmicroarray experiments, to validate probes designed for microarrayapplications, to obtain sets of training data for statistical analysis,etc.

The term “array” encompasses the term “microarray” and refers to anordered array presented for binding to nucleic acids and the like.

An “array,” includes any two-dimensional or substantiallytwo-dimensional (as well as a three-dimensional) arrangement ofspatially addressable regions bearing nucleic acids, particularlyoligonucleotides or synthetic mimetics thereof, and the like. Where thearrays are arrays of nucleic acids, the nucleic acids may be adsorbed,physisorbed, chemisorbed, or covalently attached to the arrays at anypoint or points along the nucleic acid chain.

In those embodiments where an array includes two more featuresimmobilized on the same surface of a solid support, the array may bereferred to as addressable. An array is “addressable” when it hasmultiple regions of different moieties (e.g., different oligonucleotidesequences) such that a region (i.e., a “feature” or “spot” of the array)at a particular predetermined location (i.e., an “address”) on the arraywill detect a particular sequence. Array features are typically, butneed not be, separated by intervening spaces. In the case of an array inthe context of the present application, the “population of labelednucleic acids” will be referenced as a moiety in a mobile phase(typically fluid), to be detected by “surface-bound polynucleotides”which are bound to the substrate at the various regions. These phrasesare synonymous with the arbitrary terms “target” and “probe”, or “probe”and “target”, respectively, as they are used in other publications.

A “scan region” refers to a contiguous (preferably, rectangular) area inwhich the array spots or features of interest, as defined above, arefound or detected. Where fluorescent labels are employed, the scanregion is that portion of the total area illuminated from which theresulting fluorescence is detected and recorded. Where other detectionprotocols are employed, the scan region is that portion of the totalarea queried from which resulting signal is detected and recorded. Forthe purposes of this invention and with respect to fluorescent detectionembodiments, the scan region includes the entire area of the slidescanned in each pass of the lens, between the first feature of interest,and the last feature of interest, even if there are intervening areasthat lack features of interest.

The term “substrate” as used herein refers to a surface upon whichmarker molecules or probes, e.g., an array, may be adhered. Glass slidesare the most common substrate for biochips, although fused silica,silicon, plastic, flexible web and other materials are also suitable.

An “array layout” refers to one or more characteristics of the features,such as feature positioning on the substrate, one or more featuredimensions, and an indication of a moiety at a given location.“Hybridizing” and “binding”, with respect to nucleic acids, are usedinterchangeably. The terms “hybridizing,” “hybridizing specifically to,”and “specific hybridization” as used herein, refer to the binding,duplexing, or hybridizing of a nucleic acid molecule preferentially to aparticular nucleotide sequence under stringent conditions.

The term “stringent assay conditions” as used herein refers toconditions that are compatible to produce binding pairs of nucleicacids, e.g., probes and targets, of sufficient complementarity toprovide for the desired level of specificity in the assay while beingincompatible to the formation of binding pairs between binding membersof insufficient complementarity to provide for the desired specificity.The term stringent assay conditions refers to the combination ofhybridization and wash conditions.

A “stringent hybridization” and “stringent hybridization washconditions” in the context of nucleic acid hybridization (e.g., as inarray, Southern or Northern hybridizations) are sequence dependent, andare different under different environmental parameters. Stringenthybridization conditions that can be used to identify nucleic acidswithin the scope of the invention can include, e.g., hybridization in abuffer comprising 50% formamide, 5×SSC, and 1% SDS at 42° C., orhybridization in a buffer comprising 5×SSC and 1% SDS at 65° C., bothwith a wash of 0.2×SSC and 0.1% SDS at 65° C. Exemplary stringenthybridization conditions can also include a hybridization in a buffer of40% formamide, 1 M NaCl, and 1% SDS at 37° C., and a wash in 1×SSC at45° C. Alternatively, hybridization to filter-bound DNA in 0.5 M NaHPO₄,7% sodium dodecyl sulfate (SDS), 1 mnM EDTA at 65° C., and washing in0.1×SSC/0.1% SDS at 68° C. can be employed. Yet additional stringenthybridization conditions include hybridization at 60° C. or higher and3×SSC (450 mM sodium chloride/45 mM sodium citrate) or incubation at 42°C. in a solution containing 30% formamide, 1 M NaCl, 0.5% sodiumsarcosine, 50 mM MES, pH 6.5. Those of ordinary skill will readilyrecognize that alternative but comparable hybridization and washconditions can be utilized to provide conditions of similar stringency.

In certain embodiments, the stringency of the wash conditions determinewhether a nucleic acid is specifically hybridized to a probe. Washconditions used to identify nucleic acids may include, e.g.: a saltconcentration of about 0.02 M at pH 7 and a temperature of about 20° C.to about 40° C.; or, a salt concentration of about 0.15 M NaCl at 72° C.for about 15 minutes; or, a salt concentration of about 0.2×SSC at atemperature of about 30° C. to about 50° C. for about 2 to about 20minutes; or, the hybridization complex is washed twice with a solutionwith a salt concentration of about 2×SSC containing 1% SDS at roomtemperature for 15 minutes and then washed twice by 0.1×SSC containing0.1% SDS at 37° C. for 15 minutes; or, equivalent conditions. Stringentconditions for washing can also be, e.g., 0.2×SSC/0.1% SDS at 42° C. SeeSambrook, Ausubel, or Tijssen (cited below) for detailed descriptions ofequivalent hybridization and wash conditions and for reagents andbuffers, e.g., SSC buffers and equivalent reagents and conditions.

A specific example of stringent assay conditions is rotatinghybridization at 65° C. in a salt based hybridization buffer with atotal monovalent cation concentration of 1.5 M (e.g., as described inU.S. patent application Ser. No. 09/655,482 filed on Sep. 5, 2000, thedisclosure of which is herein incorporated by reference) followed bywashes of 0.5×SSC and 0.1×SSC at room temperature.

Stringent hybridization conditions may also include a “prehybridization”of aqueous phase nucleic acids with complexity-reducing nucleic acids tosuppress repetitive sequences. For example, certain stringenthybridization conditions include, prior to any hybridization tosurface-bound polynucleotides, hybridization with Cot-1DNA, or the like.

Stringent assay conditions are hybridization conditions that are atleast as stringent as the above representative conditions, where a givenset of conditions are considered to be at least as stringent ifsubstantially no additional binding complexes that lack sufficientcomplementarity to provide for the desired specificity are produced inthe given set of conditions as compared to the above specificconditions, where by “substantially no more” is meant less than about5-fold more, typically less than about 3-fold more. Other stringenthybridization conditions are known in the art and may also be employed,as appropriate.

The term “mixture”, as used herein, refers to a combination of elements,that are interspersed and not in any particular order. A mixture isheterogeneous and not spatially separable into its differentconstituents. Examples of mixtures of elements include a number ofdifferent elements that are dissolved in the same aqueous solution, or anumber of different elements attached to a solid support at random or inno particular order in which the different elements are not especiallydistinct. In other words, a mixture is not addressable. To be specific,an array of surface-bound polynucleotides, as is commonly known in theart and described below, is not a mixture of capture agents because thespecies of surface-bound polynucleotides are spatially distinct and thearray is addressable. “Isolated” or “purified” generally refers toisolation of a substance (compound, polynucleotide, protein,polypeptide, polypeptide, chromosome, etc.) such that the substancecomprises the majority percent of the sample in which it resides.Typically in a sample a substantially purified component comprises 50%,preferably 80%-85%, more preferably 90-95% of the sample. Techniques forpurifying polynucleotides, polypeptides and intact chromosomes ofinterest are well-known in the art and include, for example,ion-exchange chromatography, affinity chromatography, sorting, andsedimentation according to density.

The terms “assessing” and “evaluating” are used interchangeably to referto any form of measurement, and include determining if an element ispresent or not. The terms “determining,” “measuring,” and “assessing,”and “assaying” are used interchangeably and include both quantitativeand qualitative determinations. Assessing may be relative or absolute.“Assessing the presence of” includes determining the amount of somethingpresent, as well as determining whether it is present or absent.

The term “using” has its conventional meaning, and, as such, meansemploying, e.g., putting into service, a method or composition to attainan end. For example, if a program is used to create a file, a program isexecuted to make a file, the file usually being the output of theprogram. In another example, if a computer file is used, it is usuallyaccessed, read, and the information stored in the file employed toattain an end. Similarly if a unique identifier, e.g., a barcode isused, the unique identifier is usually read to identify, for example, anobject or file associated with the unique identifier.

If a surface-bound polynucleotide “corresponds to” a chromosome, thepolynucleotide usually contains a sequence of nucleic acids that isunique to that chromosome. Accordingly, a surface-bound polynucleotidethat corresponds to a particular chromosome usually specificallyhybridizes to a labeled nucleic acid made from that chromosome, relativeto labeled nucleic acids made from other chromosomes. Array features,because they usually contain surface-bound polynucleotides, can alsocorrespond to a chromosome.

A “non-cellular chromosome composition”, as will be discussed in greaterdetail below, is a composition of chromosomes synthesized by mixingpre-determined amounts of individual chromosomes. These syntheticcompositions can include selected concentrations and ratios ofchromosomes that do not naturally occur in a cell, including any cellgrown in tissue culture. Non-cellular chromosome compositions maycontain more than an entire complement of chromosomes from a cell, and,as such, may include extra copies of one or more chromosomes from thatcell. Non-cellular chromosome compositions may also contain less thanthe entire complement of chromosomes from a cell.

A “probe” means a polynucleotide which can specifically hybridize to atarget nucleotide, either in solution or as a surface-boundpolynucleotide.

The term “validated probe” means a probe that has been passed by atleast one screening or filtering process in which experimental datarelated to the performance of the probes was used a part of theselection criteria.

“In silico” means those parameters that can be determined without theneed to perform any experiments, by using information either calculatedde novo or available from public or private databases.

The term “duplex T_(m) ” refers to the melting temperature of twooligonucleotides which have formed a duplex structure. Duplex T_(m) iscalculated by a simple formula where each matching GC pair gets a valueof 2, and each matching AT pair gets a value of 1. The sum of theseapproximate values gives the melting temperature.

Approaches and Methods for Probe Selection

The present methods provide alternative and novel methods and systemsfor designing probes for CGH and location analysis in microarrayapplications that overcome the drawbacks of existing microarray probeselection techniques. General methods that utilize probe/targethybridization experiments and/or unique data analysis techniques toidentify and select nucleotide probe(s) targeting polynucleotidefragments in a region of interest were described in U.S. PatentPublication No. 2006/0110744. The methods described herein providestatistical methods for combining and modifying probe scores in order toachieve desired results, such as selecting or designing probes with morerobust probe performance, better probe signal, etc.

The present description provides methods, systems and computer readablemedia for identifying and selecting nucleic acid probes for detecting atarget with a nucleic acid probe array or microarray. The methodscomprise, in general terms: the selection of genomic nucleotide rangesof interest, determining appropriate target sequences for CGH and/orlocation analysis, generating candidate probes specific for the targetsequences and analyzing candidate probes for specific probe propertiesby computational and/or experimental processes to optimize probeselection and reduce the number of probes to a value appropriate forplacement on a microarray.

The description also provides microarrays comprising probes selected bythe methods described herein. The microarrays comprise a solid supportand a plurality of surface bound probes, the surface bound probes havingvery similar thermodynamic properties as well as similar GC content.More specifically, a large portion of the probes utilized in themicroarrays of the invention, have duplex melting temperatures (T_(m))which are within a narrow temperature range compared to the T_(m) rangeof probes for other microarray systems, such as arrays for geneexpression.

The methods provided herein are particularly useful with comparativegenome hybridization microarrays, such as microarrays based on the humanor mouse genome. These methods permit more cost-effective and efficientidentification of gene regions or sections which can be associated withhuman disease, points of therapeutic intervention, and potential toxicside-effects of proposed therapeutic entities.

In general terms, the methods for probe selection and validationdescribed herein comprise identifying probe properties that can bedetermined a priori by the probe's sequence and the sequence of thegenome it is contained within, and may further comprise expanding theset of properties from those that can be determined a priori, to thosethat can be measured empirically through simple experiments, such asself-self experiments. The described methods may further comprisemeasuring the response of candidate probes to a known stimulus, wherethe stimulus is generated by a set of samples where the copy numbers forrelatively small subsets of the genome are altered in known ways.

In designing an array comprising high-performance probes thatcomprehensively covers a whole genome (e.g. the human genome) the entiregenomic sequence must be searched when generating specific candidateprobes. This homology search is potentially the most time-consuming partof the probe design process. Ideally, a homology search would be thefirst part of the process, however because of the scale of the humangenome executing an exhaustive search of all possible short oligo probes(<100 bases), can take computation time on the scale of a CPU year(based on ProbeSpec), for modern 3 GHz processors. This computation timecan be reduced by any of a number of methods, most involving reducingthe scale of the search. For example, known highly repetitive sequencescan be removed by a process called RepeatMasking. Repeat-masked genomicsequences are publicly available on the web (e.g. UCSC'swww.genomebrowser.org). Another approach is to reduce the number ofprobe sequences being searched up-front. This can be done on the basisof any known property of the probe, from thermodynamic properties, suchas duplex-Tm and hairpin free energy, to position on the genome. Thepresent description provides methods which apply known probe informationas a screening process to reduce the number of probe sequences to beanalyzed in a homology search, thus reducing the computation time neededto identify appropriate probes for a CGH based array.

The present systems, techniques, methods and computer readable mediaalso provide for streamlined workflow, since researchers need only toprepare and process one microarray instead of two or more per sample,with fewer steps in processing and tracking required.

Further, greater reproducibility of results is provided for, since alldata for an entire genome is generated from a single microarray,resulting in less variability in the data. When two or more microarraysassociated with the same sample are processed separately, there arealways questions of variability of the experimental conditions used toprocess each microarray.

Designing a microarray involves determining the amount of “real estate”(number of probes) that is available for the final array. The arraydesigner also determines the amount of probes or “real estate” to usefor specified regulatory regions, intergenic regions as well the amountof probes necessary to adequately cover introns and exons of thechromosomes of interest. Initially, a designer will generate 20 to 40million candidate probes and need to filter the probes for certain probeproperties or parameters to obtain a final array with approximately40,000 probes. Intermediate arrays are manufactured in some embodimentsof the methods of the invention, which have a redundancy of 3 or 4 foldover the number of probes selected for the final array, theseintermediate arrays are utilized to screen candidate probes for certainprobe properties by direct or indirect experimentation.

In many embodiments, the oligonucleotides (i.e. probes) contained in thefeatures of the invention have been designed according to one or moreparticular parameters to be suitable for use in a given application,where representative parameters include, but are not limited to: length,melting temperature (T_(m)), non-homology with other regions of thegenome, hybridization signal intensities, kinetic properties underhybridization conditions, etc., see e.g., U.S. Pat. No. 6,251,588, thedisclosure of which is herein incorporated by reference.

Standard hybridization techniques (using high stringency hybridizationconditions) are used to probe subject array. Suitable methods aredescribed in references describing CGH techniques (Kallioniemi et al.,Science 258:818-821 (1992) and WO 93/18186). Several guides to generaltechniques are available, e.g., Tijssen, Hybridization with Nucleic AcidProbes, Parts I and II (Elsevier, Amsterdam 1993). For a descriptions oftechniques suitable for in situ hybridizations see, Gall et al. Meth.Enzymol. 21 :470-480 (1981) and Angerer et al. in Genetic Engineering:Principles and Methods (Setlow and Hollander, eds.), vol. 7, pp. 43-65(Plenum Press, New York 1985). See also U.S. Pat. Nos. 6,335,167;6,197,501; 5,830,645; and 5,665,549; the disclosures of which areincorporated herein by reference.

FIG. 1 shows a general description of the methods described herein. Inan aspect, as in operation 100, a candidate oligonucleotide for aparticular region of interest in a target nucleic acid sequence isgenerated. The candidate probe is then screened with one or more metricsor parameters that are predictive of probe performance, as in theoperation 102, which yields a probe score for each metric. Theindividual probe scores are then combined to produce a combined scorefor the probe in operation 103. The probe with the best score is thenselected, as in 104, for a subsequent microarray application.

Methods for Selecting Oligonucleotide Probes

The methods described herein are directed to selection ofoligonucleotide probes for use in microarray applications. Two or morecandidate oligonucleotide probes are generated and analyzed using one ormore metrics that are indicative of probe performance. An individualprobe score is obtained with respect to each metric, and these probescores are then combined into a single score for the probe. Probes withcombined scores closest to an optimal score value are selected as idealor best probes (i.e. those probes which are most suited to a particularmicroarray experiment, in terms of ability to hybridize to the targetsequences, reproducibility, repeatability, etc). Probes may be scored onany numerical scale, with the best probes having scores closest to thehigh end of the numerical scale. For example, an optimal score value ona scale of 0.0 to 1.0 would be about 1.0. Similarly, on a numericalscale of probe scores from 50 to 100, an optimal score value would beabout 100.

In embodiments, two or more candidate probes are generated by selectingone or more target sequences within a region of interest andsubsequences of the target are tiled across the entire region ofinterest to obtain a set of potential probes. In aspects, thesubsequences of the target sequences are tiled in single base stepsacross the region of interest. This generates a large set of potentialprobes, which are reduced to a manageable number (such as greater than2, but less than 1000, for example) by pairwise filtering.

Once the candidate probes are generated, they are analyzed using metricsthat are indicative of probe performance, and each probe is assigned aprobe score. These metrics include direct metrics, indirect metrics andin silico metrics. Direct metrics comprise the changes in probe responsebased on experimentally measured quantities, such as change in copynumber. Indirect metrics used comprise changes in predicted proberesponse resulting from experimentally measured quantities for a targetmolecule, or changes in predicted probe response measured usingempirical relationships based on direct responses from otherprobe-target molecule duplexes. The in silico metrics comprise changesin the probe response based on calculated quantities for a targetmolecule, or changes in probe response measured using empiricalrelationships based on direct responses from other probe-target moleculeduplexes. To obtain a probe score from the application of one or moremetrics, the slope for each candidate probe is calculated and plottedagainst the corresponding value for each metric to generate a trendcurve. The trend curve is then fitted with a polynomial function toobtain the probe score for each metric. The order of the polynomial canrange from 1 to 20.

Individual probe scores for each metric are then combined, by adding oraveraging the individual scores, to give a combined probe score. Theindividual probe scores can also be fitted with a linear additivemultivariate fitting function, or a linear multiplicative fittingfunction to give the combined score. In aspects, combined probe scoresare obtained by combining the metrics in each category using a linearmodel to obtain intermediate scores. The intermediate scores are thenmultiplied together to give the combined score. Individual probe scorescan also be combined by fitting the measured slope responses for atraining data set with a change in copy number.

The combined scores can be synthetically modified to give probes withmore robust predicted performance (i.e. a probe which more effectivelymimics probe performance in an actual experiment). In aspects, thesynthetic modification comprises generating a large candidate set ofprobes, and reducing the number of probes by pairwise reduction. A slopeis calculated for each probe and plotted against the corresponding slopefor each metric to generate a trend curve. This trend curve is fitted togive a measured probe score, and the fitted trend curve is replacedusing a synthetic curve and a predicted score is obtained. The predictedscores are combined with experimentally measured scores to give thecombined score value for a particular probe. The probe (or probes) witha combined score value closest to the optimal score value is selectedfor microarray applications.

Probe selection is performed using a computational analysis system whichcomprises a computer-readable medium with a program that selects probesfor microarray applications as in the methods described herein. Themethods can be used to produce or fabricate a microarray comprising atleast two probes selected according to the methods described herein.

Generating Candidate Probes

In an embodiment, a candidate oligonucleotide probe, or set of probes,for a particular region of interest in a target nucleic acid sequence isgenerated, as in operation 100, an expanded representation of which isshown in FIG. 2. Briefly, operation 100 begins with the selection oridentification 200 of target nucleic acid sequences within a genome. Thecandidate probe or candidate set of probes is any probe or set of probeswithin (or capable of hybridizing to) the target sequence or genomeincluding, without limitation, genes, exons, mRNA, a region of interestwithin the target sequence, probes used or selected for previousexperiments, upstream or downstream regulatory regions of genes,methylated regions, regions associated with putative SNPs or CNPs,sequence aberrations known to be associated with particular diseasestates or phenotypes, histones or binding sites in the sequence forother molecules, etc. Potential target sequences of the nucleotidesample of interest are identified, filtered and reduced to a set ofappropriate target sequences for CGH and/or location analysis. Thepotential target sequences are filtered by size, number of repeat-maskedbases and/or GC-content. Target sequences are also filtered and reducedin number by eliminating repetitive target sequences. Another parameterwhich can be used to filter target sequence is to eliminate potentialtarget sequences which comprise a restriction enzyme cut site. Bylimiting the size of the set of target sequences, the computational timeneeded to generate and analyze the candidate probes is decreased.

Generating a set of candidate probes comprises selecting subsequences ofthe selected target nucleic acid sequences across genomic regions ofinterest, as in operation 202. Probes are tiled in uniform or moderatelyuniform spacing, in steps as small as a single base, or as large asmegabases, through the genome, targeted region of the genome, or targetnucleic acid sequence. For example, probes may be tiled in steps of50-100 bases across the entire genome, but the methods described hereinare not dependent on the scale of the tiling The smaller the scale ofthe tiling, the larger the number of potential candidate probes forminga plurality of candidate probes. The number of potential candidateprobes over an interval should exceed the number expected to be selectedover that same interval. Candidate probes are selected from theplurality of candidate probes based on parameters, for example, a narrowrange of a specific parameter such as probe length. Probe parametersutilized to select candidate probes from a plurality of potentialcandidate probes may include, but are not limited to, targetspecificity, thermodynamic properties, expression and association withgenes, homology and kinetic properties.

In some embodiments, the probe parameters include, but are not limitedto, a range of T_(M) of about 0.25° C. to about 5° C., a T_(M) value ofabout 65° C. to about 85° C., a nucleotide length of 20 to 200nucleotides, a range GC content % of less than 10%, and/or % GC contentabout 30-40%. When length of the probe is a criteria, probes have anucleotide length of about 20 nucleotides to about 200 nucleotides,usually about 40 nucleotides to 100 nucleotides, and more usually 50 to65 nucleotides.

Typically, 30 to 60-mer candidate probes are selected, but the candidateprobes may range from about 20-mer to about 200-mer. Typically, probesmay be selected over spacings of approximately half the length of theprobe. For example, for a 60-mer candidate probe, 30 bp intervals wouldbe selected over the entire genome, or regions of interest. Usually, therepeat-masked regions are skipped, as they are usually insufficientlyunique to be of use. Also, if the assay involves the use of arestriction digest, the restriction sites within the sequence for therestriction enzymes specified within the protocol are also typicallyexcluded, or those probes subsequently excluded from the candidate set.

The large number of candidate probes generated by this process is thenreduced to a smaller set of candidate probes using a reduction method,such as the pairwise reduction method, for example, as shown inoperation 204. The pairwise reduction method evaluates a pair ofcandidate probes for a probe property and scores the probes within thepair against each other according to the probe property analyzed. Thepairwise reduction process reduces the number of probes by a factor ofX, where X may be any number that significantly reduces the number ofprobes. For example, the pairwise reduction process may reduce thenumber of probes by a factor of 5, 10, 15, 20, 25, 30 and so on. Thenumber of candidate probes can also be reduced by any other method oralgorithm that uses the position of the probe and a combined or overallscore to discriminate between probes.

Following the reduction process, the candidate probes are optionallyexperimentally validated, as in 206. The experimental validation processinvolves experiments which measure the properties of a probe thatprovide a good indication of the probe's performance (i.e., suitability)in a microarray experiment, in the absence of direct experiments ordata. Experimentally measurable probe properties include, withoutlimitation, raw signal intensity, reproducibility of signal intensity,dye bias, susceptibility of non-specific binding, etc. The process forprobe selection, pairwise reduction and experimental validation aredescribed in detail in U.S. Patent Publication No. 2006/0110744 and WO2004/059845, the disclosures of which are incorporated herein byreference.

Once at least two candidate probes are selected, the candidate probesare analyzed with one or more metrics that predict or indicate probeperformance to generate a probe score for each metric.

Analyzing Candidate Probes with Metrics

As shown in FIG. 1, embodiments of the methods described herein includea process 102 for analyzing candidate probes with one or more probeperformance parameters or metrics. The term “parameter” or “metric”refers to a quantity or property that is indicative of a probe'sperformance in a microarray experiment. Three types of metrics are usedwith the methods described herein: direct metrics, indirect metrics, andin silico metrics.

Direct metrics are those that directly measure probe performance. Directmetrics measure performance by observing the change in probe response,as measured by the signal or ratio of signals, or log of the ratio ofsignals, with respect to a reference sample, with a change in copynumber of the target nucleic acid sequence or region of interest, usingmultiple hybridization experiments on multiple arrays, with theconditions maintained as similar as possible between arrays. The changein probe response is measured as a change in signal in a differentialmodel system, such as a dye-swap or dye-flip experiment, for example,where DNA copy number is changed in known or predictable ways. Forexample, in an experiment to evaluate the performance of probes on the Xchromosome using a normal pair of female-male samples, the probe isexpected to produce a 2:1 signal ratio, as there are twice as many Xchromosome target molecules in the female sample as in the male sample,and there are no Y chromosome target molecules in the female sample.Similarly, other differential model systems such as cell lines withwell-known chromosomal aberrations, extra or missing chromosomes orregions of chromosomes, will also produce copy number changes in apredictable manner. Biological model systems that do not exist innature, but are created using cell sorting techniques, or by mixingcollections of BACs, cDNAs or other biologically-derived DNA samples canalso be used for measuring probe performance.

Indirect metrics are measured parameters or metrics that are indicativeof probe performance. Indirect metrics comprise observing the change inprobe response in relatively simple experiments using a non-differentialmodel. Indirect metrics (or indirect empirical parameters) includesignal strength (in one or both channels from which signal is measuredin a microarray experiment), dye bias (the LogRatio associated with adye label rather than the LogRatio associated with copy number),differential signals obtained from experiments under various conditions(multiple annealing times for probes to target nucleic acid sequences,wash times, wash temperatures, etc.), for example. For dye biasmeasurements, the LogRatios for dye-flip experiments are averaged,rather than subtracted as they would be to calculate the effectiveLogRatios for copy number changes. Experiments using indirect metricsare considered non-differential, because, for most of the genome, thechanges in probe response do not reflect changes in copy number (i.e. nochange in copy number is expected, between the sample and a referencesequence). Rather, indirect metrics predict the performance of a probein terms of sensitivity and specificity. For example, a measured signalthat is too strong could represent cross-hybridization of the probes tomultiple regions of the genome. On the other hand, a measured signalthat is too weak is indicative of noise, or susceptible to changes inthe condition or the quantity of DNA.

In silico metrics are calculated parameters that are indicative of probeperformance. In silico metrics are those metrics that are calculated inthe absence of any experimental data. These metrics are derived from thesequence of the probes themselves, and from the sequences of the genome,or the transcriptome of the organism being studied. In silico metricsfor each candidate probe are obtained from the sequences directly, basedon known laws of physics and chemistry, such as those related tothermodynamics. In silico metrics used in the methods described hereininclude, without limitation, duplex melting temperature (T_(m) orDuplexTm) between a probe and its complementary sequence, maximalsubsequence duplex melting temperature of a probe (MaxSubSeqTm; themaximal T_(m) for any subsequence of length M within a longer sequenceof length N), hairpin thermodynamic properties of the probe (i.e.,hairpin melting temperature, Gibbs free energy, number of bases withinturns, loops, stems, etc.), and sequence complexity (where complexityrefers to the number of bases in the probe that are contained withinshort simple repeats, such as homopolymers, dimers, trimers, tetramers,etc., for example). For example, with the methods described herein,complexity typically refers to the number of bases contained withinrepeat units with six nucleotides, i.e. hexamers, but the methodsdescribed herein can generally be employed with repeats with any numberof nucleotides.

The direct, indirect and in silico metrics are described in detail inU.S. Patent Publication No. 2006/0110744, the disclosure of which isincorporated herein by reference. The analytical process involvescalculating a slope, or the responsiveness of a probe to a change incopy number of its complementary target sequence, for each candidateprobe, as in 300, based on the response of the probe in an experimentwith respect to a particular metric. The slope for each of a set ofprobes can be measured using a model system where the relative copynumbers of the target molecules for each probe in the set is known ineach sample. The measured slope is calculated for each probe within theset, for example, X-chromosome probes, in the case of male and femalesamples. The slope can be estimated most simply by calculating the ratioof the signals in two samples with two different copy numbers oftargets. It can also be the ratio of log-signals to the log-copynumbers, or the ratio of log ratios of signals. In a more complexsystem, a number of samples can be hybridized, where each pair ofsamples has a different set of copy numbers for each respective set ofprobes. For example, in a male-female model system, some sample pairscan be male referenced to female, others can be female referenced tomale, and still others can be male referenced to male, or femalereferenced to female. This provides multiple data points for each probe.The slope for a two-color assay is then calculated by means of a linearregression of the ratios (of signals) for each probe as a function ofthe ratios of known target copy numbers in each sample. The y-interceptprovided by such regression is also useful, as it provides the dye-bias.By analogy, in a single-color assay, the regression is between themeasured signals and the known copy number.

In embodiments, the slope is calculated (as in 300) from the performanceof a probe analyzed using a direct metric, by observing the change inprobe response, as measured by the signal or ratio of signals, or log ofthe ratio of signals, with respect to a reference sample, with a changein copy number of the target nucleic acid sequence or region ofinterest, using multiple hybridization experiments on multiple arrays,with the conditions maintained as similar as possible between arrays.The change in probe response is measured as a change in signal in adifferential model system, such as a dye-swap or dye-flip experiment,for example, where DNA copy number is changed in known or predictableways. For example, in an experiment to evaluate the performance ofprobes on the X chromosome using a normal pair of female-male samples,the probe is expected to produce a 2:1 signal ratio, as there are twiceas many X chromosome target molecules in the female sample as in themale sample, and there are no Y chromosome target molecules in thefemale sample. Similarly, other differential model systems such as celllines with well-known chromosomal aberrations, extra or missingchromosomes or regions of chromosomes, will also produce copy numberchanges in a predictable manner. Biological model systems that do notexist in nature, but are created using cell sorting techniques, or bymixing collections of BACs, cDNAs or other biologically-derived DNAsamples can also be used for measuring probe performance.

In embodiments, using a direct metric, the change in signal is measuredwith respect to measured quantities such as LogRatio (i.e. the log ofthe ratio of red to green channels), LogIntensity (the log product ofred and green channel intensities), and dye bias (the average ofLogRatios for a dye-swap pair; obtained by subtracting LogRatios), forexample. For the most robust probe performance, the change in proberesponse reflects the specific LogRatio change associated with changesin copy number in dye-flip experiments, as measured by subtractingLogRatios.

In embodiments, the slope for a probe is calculated, as in 300, based onthe performance of a probe analyzed using an indirect metric, byobserving the change in probe response in relatively simple experimentsusing a non-differential model. Indirect metrics (or indirect empiricalparameters) include signal strength (in one or both channels from whichsignal is measured in a microarray experiment), dye bias (the LogRatioassociated with a dye label rather than the LogRatio associated withcopy number), differential signals obtained from experiments undervarious conditions (multiple annealing times for probes to targetnucleic acid sequences, wash times, wash temperatures, etc.), forexample. For dye bias measurements, the LogRatios for dye-flipexperiments are averaged, rather than subtracted as they would be tocalculate the effective LogRatios for copy number changes. Experimentsusing indirect metrics are considered non-differential, because, formost of the genome, the changes in probe response do not reflect changesin copy number (i.e. no change in copy number is expected, between thesample and a reference sequence). Rather, indirect metrics predict theperformance of a probe in terms of sensitivity and specificity. Forexample, a measured signal that is too strong could representcross-hybridization of the probes to multiple regions of the genome. Onthe other hand, a measured signal that is too weak is indicative ofnoise, or susceptible to changes in the condition or the quantity ofDNA.

In embodiments, the slope calculation in operation 300 is based on theperformance of a probe analyzed using in silico parameters or metrics.In silico metrics are those metrics that are calculated in the absenceof any experimental data. These metrics are derived from the sequence ofthe probes themselves, and from the sequences of the genome, or thetranscriptome of the organism being studied. In silico metrics for eachcandidate probe are obtained from the sequences directly, based on knownlaws of physics and chemistry, such as those related to thermodynamics.In silico metrics used in the methods described herein include, withoutlimitation, duplex melting temperature (T_(m) or DuplexTm) between aprobe and its complementary sequence, maximal subsequence duplex meltingtemperature of a probe (MaxSubSeqTm; the maximal T_(m) for anysubsequence of length M within a longer sequence of length N), hairpinthermodynamic properties of the probe (i.e., hairpin meltingtemperature, Gibbs free energy, number of bases within turns, loops,stems, etc.), and sequence complexity (where complexity refers to thenumber of bases in the probe that are contained within short simplerepeats, such as homopolymers, dimers, trimers, tetramers, etc., forexample). For example, with the methods described herein, complexitytypically refers to the number of bases contained within repeat unitswith six nucleotides, i.e. hexamers, but the methods described hereincan generally be employed with repeats with any number of nucleotides.

In other embodiments, in silico parameters or metrics associated withthe homology of a probe are used. These metrics include, withoutlimitation, homology score (i.e. the distance to the nearest hit, notincluding the first target sequence, within the target sequence ofinterest or genome), homology signal-to-background, expressed on a logscale (HomLogS2B, described in U.S. Patent Publication No.2006/0110744), and predicted homology response (S_(Hom)). The predictedhomology response is similar to the HomLogS2B, but instead of predictingthe signal-to-background, this score predicts the slope response of aprobe based on homology calculations alone, under the assumption thatthermodynamic and other properties of the probe are ideal. The predictedhomology score is defined by Equation 1: $\begin{matrix}{S_{Hom} = \frac{\sum\limits_{j = 1}^{{TargetSeq}.}{P\left( {mm}_{j} \right)}}{\sum\limits_{i = 1}^{Genome}{P\left( {mm}_{i} \right)}}} & (1)\end{matrix}$where P(mm_(j)) is a penalty term representing the signal contribution(under the specified hybridization conditions) for the hybridization ofthe probe of interest to each sufficiently complementary mismatchsequence within a specified target sequence or genome. The summation inthe denominator in Equation 1 is over all the sequences in the genome,or within the complex set of sequences expected to be in a sample or setof samples. The numerator in Equation 1 represents the target sequenceof interest. In the most specific case, the target sequence refers tothe small specific sequence for which the probe is being designed (i.e.within a particular locus within a narrow region of a specificchromosome or region of interest in the genome for which the probe isdesigned).

In the specific case, the equation can be simplified as shown inEquation 2: $\begin{matrix}{S_{Hom} = \frac{1}{\sum\limits_{i = 1}^{Genome}{P\left( {mm}_{i} \right)}}} & (2)\end{matrix}$The function P(mm_(j)) can be calculated using a model for thehybridization between two oligonucleotide sequences using nearestneighbor models. The term is dependent on the number of mismatches, thedistribution of mismatches through the aligned sequences, the specificmismatched bases, and the length of the overlap. Although all possiblesequences within the target nucleic acid sequence or genome should beconsidered, in practice, only those sequences that are homologous enoughto the probe sequence are considered. For example, with 60-mer probes,all subsequences in the genome that align with fewer than about 20 basesare considered.

This model can be further simplified, by approximating the homologyslope response by using the distances or number of mismatches betweenthe probe and the nearest hit (i.e. closest in homology) sequence, asshown in Equation 3: $\begin{matrix}{S_{Hom} = \frac{\sum\limits_{d = 0}^{D}{P_{d}M_{d}}}{\sum\limits_{d = 0}^{D}{P_{d}N_{d}}}} & (3)\end{matrix}$where N_(d) represents the total number of hits at a distance d, where dis defined as the number of single-base difference between the probe ofinterest and the target nucleic acid sequence or region of interest inthe genome, and D is the maximum distance that needs to be considered.The denominator represents the signal contributions of all probes in thecomplex set of sequences, including the target sequence. The numeratorrepresents either the target for the probe sequence, or, if a modelsystem is being used, the region of the model system sequence that isbeing varied. For example, if the model system is a whole chromosome,then M_(d) represents all the hits within the chromosome at a distance dfrom the probe of interest. P_(d) is the signal penalty for eachmismatch at a distance d. A perfect match has P_(d)=1, and the value ofP_(d) decreases towards zero as the number of mismatches increase (i.e.as the system becomes more destabilized). This is an approximation basedon the assumption that the average signal reduction across a largenumber of mismatches is a good representation for any single mismatch.That is, each mismatched base (or insertion or deletion) can be assigneda constant penalty P, giving Equation 4 as the relationship between asingle-base penalty and distance:P_(d)≈P^(d)  (4)

In still other embodiments, in silico parameters or metrics that combinehomology with thermodynamic properties may be used. For example,maxTemp, defined as the duplex melting temperature (or T_(m)) betweenthe probe and the longest contiguous match within each homologoussequence in the background genome, can be used as an in silico metricfor probe performance. In other embodiments, the melting temperature ofthe closest mismatch to the probe sequence in the genome(MMClosestDuplexTm) as calculated from the nearest neighbor model canalso be used to predict probe performance.

The methods described herein for selecting an oligonucleotide probe fora microarray application include a step for screening candidate probesagainst probe performance metrics, as indicated in FIG. 1, at operation102. This operation is further depicted in FIG. 3. The screening processbegins with the calculation or determination of a slope for eachcandidate probe, based on each metric, as indicated in operation 300.Using the X chromosome as a model system to characterize probeperformance, empirical measurements of signal changes or LogRatiochanges are made. From these empirical results, the slope for each probeis calculated. The slope is defined in either linear or logarithmicspace as the ratio of a measured signal or LogRatio to a known ordeliberate change in copy number. For probes with measurements atmultiple distinct copy numbers, the slope is calculated from the signalsor ratios on the y-axis and the known or expected copy number (orfold-change) on the x-axis. For example, where there are only two copynumber values, the slope is the difference between the y-axis values andx-axis values. For probes with ideal response/performance, the slopeapproaches 1.00. For data points at more than two copy numbers, theslope for each probe is calculated from the best-fit line for a plot ofthe signal, signal ratios or LogRatios. In embodiments, the slope fordata generated from more than two copy numbers is analyzed usingstatistical methods that eliminate outliers, such as a fitting methodthat weights data points by variance, for example.

The calculated slope is then plotted against each metric to give a trendcurve, as in 302, which can be used to determine the relationshipbetween a given metric and the performance of the probe. The trend curveis then smoothed fitted with an appropriate theoretical function, suchas a set of polynomials with order as high as 20, as in 304, in order todetermine the effect variables have on the slope for a given metric. Anyset of orthonormal basis functions, as known to those of skill in theart, can be used for the fit. The smoothed or fitted trend curve canthen be used to generate a probe score, with each probe being assigned ascore, as in 306. The probe scores are assigned based on an arbitrarynumerical scale. A probe score at or near the highest end of the scaleindicates optimal or best probe performance. For example, the probescores could lie between 0 to 1 on a numerical scale, and probes areselected if the probe score is closer to 1.0 (i.e. a score closest to 1implies ideal or best probe performance in a given microarrayexperiment). Similarly, a numerical scale from 50 to 100 could be used,where probes with scores closest to 100 are selected. In other words,the probe with the best or optimal score is selected depending on thescale employed. Any number of scales, with any variation of numericalranges, can be employed.

Generating Trend Curves for Measured and Predicted Slope

In embodiments, in order to characterize the relationship between theperformance of a probe and the metric used to gauge that performance,the slope points calculated for each probe based on empirical data areplotted against the corresponding values for each metric, as indicatedin FIG. 3, at operation 302. For example, using the differential modelsystem of chromosome X and female-male pairs, data is obtained where thetarget copy numbers are changed by a predictable ratio (i.e. 2:1). Whenthe measured slope for a set of probes is plotted against a givenmetric, a distribution plot is obtained. For example, FIG. 4 shows adistribution plot of the calculated slope from different arrays againstduplex melting temperature (or DuplexT_(m)), an in silico metric. Usefulinformation with regard to the relationship between probe performanceand DuplexT_(m) exists if the distribution contains discrete data points(i.e. data points that do not cluster in a round and fuzzy manner). Thecalculated slope is then plotted against each metric to give a trendcurve, as in 302, which can be used to determine the relationshipbetween a given metric and the performance of the probe. The trend curveis then smoothed fitted with an appropriate theoretical function, suchas a set of polynomials with order as high as 20, as in 304, in order todetermine the effect variables have on the slope for a given metric. Anyset of orthonormal basis functions, as known to those of skill in theart, can be used for the fit. The smoothed or fitted trend curve canthen be used to generate a probe score, with each probe being assigned ascore, as in 306. The probe scores are assigned based on an arbitrarynumerical scale. A probe score at or near the highest end of the scaleimplies optimal or best probe performance. For example, the probe scorescould lie between 0 to 1 on a numerical scale, and probes are selectedif the probe score is closer to 1.0 (i.e. a score closest to 1 impliesideal or best probe performance in a given microarray experiment).Similarly, a numerical scale from 50 to 100 could be used, where probeswith scores closest to 100 are selected. In other words, the probe withthe best or optimal score is selected, but the scale on which the probesare scored is not significant.

The probe with the best or optimal score is assumed to be a “good”probe, i.e. one that is particularly suitable for use in a specificmicroarray experiment. For example, although not limited to this aspect,the best probe selected according to the present methods may be the onethat hybridizes most strongly to the target sequences. The actualunderlying relationship (between probes and scores for each metric) canbe extracted from these distributions by generating a trend curve. Trendcurves can be obtained from the slope date for each metric by a numberof methods, including, without limitation, polynomial fits, cubic-splinefits, Fourier transforms, inverse transforms, smooth functional curves(for example, exponentials, arctangents, etc.), Boltzmann distributioncurves, etc. Any curve that approximately follows the trend of the datais useful. In embodiments, a straight line fit is appropriate. In otherembodiments, the data can also be smoothed and fitted using methods likemoving averages, moving medians, LOWESS, LOESS, etc., for example.

An example of a trend curve used in the methods described herein isshown in FIG. 5 (for DuplexT_(m) vs. measured slope). Each point in thetrend curve represents the median value for data sorted by rank on thex-axis and then smoothed in 1% bins using a non-linear polynomialfitting method (i.e., the range of data is split into equal-sized bins,each bin containing about 1% of the data). From the trend curve, it ispossible to see a relationship between a given metric and theperformance of the probe. For example, the trend curve in FIG. 5indicates that probe performance is best for probes selected on thebasis of DuplexT_(m) close to 80° C.

In embodiments, it is useful to make the trend curves for a given metricmore pronounced, to determine the independent effect that a variable mayhave on the calculated slope. The response for a given metric can beimproved by filtering out a set of values (for a second metric) that arenot viable for good probes, or by tuning in on a narrow range whereselected probes are expected to be found. For example, if most of theselected probes are expected to occur within a narrow range ofDuplexT_(m), then a trend curve can be generated by selecting probeswithin that narrow range for a particular metric. In embodiments, thetrend curves are fitted with polynomials as high as 20th order, as inoperation 304 in FIG. 3. An example of such a fitted slope forDuplexT_(m) is shown in FIG. 6.

Statistical Methods for Combining Probe Scores

In embodiments, the trend curves are used to generate individual probescores that are then combined to give a combined probe score assigned toeach candidate probe, as indicated in operation 306 of FIG. 3. Thecombined or common probe score varies from approximately zero toapproximately 1, with a score closer to 1 implying ideal probeperformance. In the simplest form, individual probe scores for eachmetric (S_(m)(p_(i))) are combined into a single combined score(S_(c)(p_(i)); for each probe p_(i)) by adding or averaging the scoresfor each metric, according to the Equation 5: $\begin{matrix}{S_{c} = {\sum\limits_{m}{S_{m}\left( p_{i} \right)}}} & (5)\end{matrix}$As long as the score for a given metric increases (or decreases) in thesame direction, combining the individual probe scores by adding oraveraging is sufficient to provide consistent improvements in probeperformance with increasing values of the combined score. Once a probescore has been assigned to each probe (or subset of probes), it isstraightforward to select probes with the highest score within a windowof interest (i.e. a region of interest in the target nucleic acidsequence or genome). In embodiments, probes can also be selected viapairwise filtering/pairwise elimination, a process for reducing thenumber of probes in a large set, described in detail in U.S. PatentPublication No. 2006/0110744, which is incorporated by reference herein.

In other embodiments, individual probe scores obtained from the trendcurves are combined by fitting multiple scores for a training data set,using the change in the measured slope response with a change in copynumber in a model system, as provided in Equation 6: $\begin{matrix}{S_{c} = {\sum\limits_{m}{C_{m}{S_{m}\left( p_{i} \right)}}}} & (6)\end{matrix}$A number of different methods are available to combine and fitmultivariate date in this manner, including, without limitation,principle component analysis (PCA), partial Least-squares (PLS),chemometrics, as well as other methods.

In still other embodiments, a linear fitting function is used, involvingtaking the inverse of a matrix (as implemented in Matlab). In thisapproach, the vector of the measured slope for each probe is representedas Y (one value per probe), and the matrix of the scores as M (number ofmetrics +1, # of probes), where all but one of the columns of M arevectors of scores for each metric, and the last column is a vector ofones, representing an additive constant. Equation 7 describes the basicrelationship between the score S and the matrix of the scores M:S=CM  (7)where the matrix C is a linear vector, with the coefficient C_(m)representing each element of the matrix (one term for each metric plusthe constant term). Multiplying both sides by the inverse of M andsolving for C gives the following (Equation 8):C=SM ⁻¹  (8)where M M⁻¹ is I, the identity matrix, and M⁻¹ is approximated usingpinv(M), the Moore-Penrose pseudoinverse of M. M⁻¹ is implemented as aMatlab function that involves a singular value decomposition (i.e., acommon mathematical method to invert a matrix or solve a set of linearequations, available with many commercially available ). This approachallows any number of metrics to be included in the score calculation, aslong as the metrics provide information in improving the performance ofthe selected probes. An example of fitted data obtained from using theabove linear matrix functions is shown in FIG. 7, which depicts probescores obtained by plotting the combined fitted slopes for fourdifferent metrics (DuplexTm, complexity, HomLogS2B and MaxSubSeqTm)against the measured slope for the X chromosome model system.

In embodiments, improved probe performance with respect to variousmetrics may also be obtained using a multiplicative fitting function,rather than an additive function. In a multiplicative curve fit, severalindividual scores, or combined scores, are multiplied together toproduce a combined score. The metrics in each category are firstcombined using a linear method (such as the additive fitting alreadydescribed) to produce intermediate scores. These intermediate scores arethen combined using a multiplicative approach.

In embodiments, the scores associated with different probes are combinedlinearly to give an overall score for a particular metric. The overallscore for each metric is then combined in a multiplicative fit with theoverall score for other metrics. For example, the thermodynamic scoresrelated to the duplex melting temperature (DuplexTm) are combinedlinearly to give an overall duplex-thermodynamics score D; the homologyscores are also combined to give an overall homology score H, and anystructural scores for the probe are combined independently giving P,with target structural scores (if any) combined to give an overalltarget score T. Each of these phenomena can independently lead todecreased probe performance. For example, a nearly ideal probe I withperfect homology scores, but poor thermodynamics scores may only have aslope of 50%. Similarly, a probe with perfect thermodynamic scores butpoor homology score also may have only a slope of 50%. It follows thenthat a probe with both relatively poor thermodynamics (50%) andrelatively poor homology scores (50%) will have a slope of 25% ratherthan 50%, as would be predicted by a linear model. This process willyield an overall probe score that varies between approximately zero andone. The coefficients for the combining of additive terms and theoffsets are fitted to the data according to the following equation(Equation 9):S=C _(m) DPHT+C _(a)  (9)where C_(m) are the multiplicative coefficients, and C_(a) is anadditive coefficient.

When observing the relationship between probe performance, as measuredby the slope response, it is seen that the slope trend continues toimprove as the trend tends towards lower T_(m). This means that probeswith very low T_(m) would show ideal performance. However, while this istrue with respect to the model system, and systems where a largequantity of high quality DNA is plentiful, it will not necessarily betrue for many biological and clinical samples where the DNA quantity isvery low, or where the DNA is degraded (as in a biopsy sample, forexample). Therefore, in embodiments, to increase the robustness of themethods described herein, the score curves are modified, to take intoaccount effects associated with real samples, such as, but not limitedto, DNA degradation or low DNA concentration. The methods produce morerobust results, if the modified score curves more accurately reflect theperformance of real probes in a real biological sample. In particular,the methods are modified in order to produce consistently high signalsand robust results, while minimizing the negative impact on proberesponse.

In embodiments, the methods herein are modified by replacing the fittedDuplexTm trend curves, as in FIG. 8, with various synthetic curves, inorder to determine the effects of the modification on probe T_(m)distributions and signal distributions. In FIG. 8, the solid linerepresents the fitted Tm-slope response curve (i.e. not a syntheticcurve), while the dotted line represent a synthetically generatedasymmetric Lorentzian curve, with half-width to left of center at 20°C., and half-width to right of center at 10° C. The dash-dotted line isa symmetric triangle function, with half width at 10° C., while thedashed line is a symmetric exponential decay function with a half-widthof 7° C. All the synthetic curves in FIG. 8 are centered at 80° C. In anaspect, the generation of synthetic curves begins with a candidate poolwith a large number of X chromosome probes (about 1.4 million). Pairwisefiltering, as discussed earlier, is used to select different sets ofapproximately evenly spaced probes from a candidate set on the basis ofthe combined score for each set of probes (i.e. probes with the optimalscore on an arbitrary numerical scale). This method helps enrich thecandidate pool with probes with higher scores (i.e. “good” probes).Briefly, the pairwise filtering method uses the probe's combined scoreas the target value or parameter, with the pairwise algorithm selectingone of each pair of probes that has the closest score to the targetvalue (i.e. 1). The goal is to select probes within a relatively narrowT_(m) range, based on the idea that probes with near ideal performancewill typically fall within a narrow T_(m) range.

The score curve synthesized in this manner are used to generate selectedmeasured slope distributions shown in FIG. 9, which shows the probeT_(m) distributions that result from the use of the various combinedscores in the pairwise reduction by about a factor of 180:1. The thicksolid line is a distribution of T_(m) values of the candidate probes.The thin solid line shows a distribution of selected probe meltingtemperatures when the fitted T_(m) slope response is used as onecomponent of the combined score in the selection of probes. Each of thefollowing, the synthetic curves replaces only the fitted T_(m)-sloperesponse component in its contribution to the total combined score. Theweights Ci of the various scores are kept constant. The dotted line isT_(m) distribution using the asymmetric Lorentzian function component,the dash-dotted line is T_(m) distribution using the symmetric trianglefunction component, and the dashed line is T_(m) distribution using thesymmetric exponential decay function component.

It can be seen from FIG. 9 that the peaks in the distribution shift fromlow values with the original scoring system to values closer to theoptimal 80 degree T_(m), with the modified scoring system (the sharpcurve with a peak at 80 degrees is an artifact resulting from thecandidate probe selection method, and not a function of the modifiedscoring system).

In embodiments, the methods described herein use a combination ofexperimentally measured slopes and predicated slopes to select probesfor a microarray application. Such a combination is possible because thepredicted slope has the same units and varies over the same range as themeasured slope. Consequently, as experimental data becomes available,the predicted slope can be replaced with the measured slope whenperforming probe selection. This approach can be applied in a number ofdifferent ways. For example, the predicted slope cam simply be replacedwith the measured slope when experimental data is collected. In anotherembodiment, probes with measured slopes may be preferred over those withcomparable predicted slopes by applying a numerical bias to the score,thereby reducing the risk of selected a probe with good predictedparameters but poor actual performance. In yet another embodiment,uncertainty values are assigned to both the predicted and measuredslopes, and the score values and uncertainties are taken into accountfor probe selection.

Arrays

The present description also provides nucleic acid microarrays producedusing the subject methods, as described herein. The subject arraysinclude at least two distinct nucleic acids that differ by monomericsequence immobilized on, e.g., covalently on, different and knownlocations on the substrate surface. In certain embodiments, eachdistinct nucleic acid sequence of the array is typically present as acomposition of multiple copies of the polymer on the substrate surface,e.g., as a spot on the surface of the substrate. The number of distinctnucleic acid sequences, and hence spots or similar structures, presenton the array may vary, but is generally at least 2, usually at least 5and more usually at least 10, where the number of different spots on thearray may be as a high as 100, 1000, 10,000, 100,000, 1,000,000 orhigher, depending on the intended use of the array. The spots ofdistinct polymers present on the array surface are generally present asa pattern, where the pattern may be in the form of organized rows andcolumns of spots, e.g., a grid of spots, across the substrate surface, aseries of curvilinear rows across the substrate surface, e.g., a seriesof concentric circles or semi-circles of spots, and the like. Thedensity of spots present on the array surface may vary, but willgenerally be at least about 10 and usually at least about 100 spots/cm²,where the density may be as high as 10⁶ or higher. In other embodiments,the polymeric sequences are not arranged in the form of distinct spots,but may be positioned on the surface such that there is substantially nospace separating one polymer sequence/feature from another. An exemplaryarray is described in U.S. Patent Publication No. 20050095596, which isincorporated herein by reference.

Arrays can be fabricated using drop deposition from pulsejets of eitherpolynucleotide precursor units (such as monomers) in the case of in situfabrication, or the previously obtained polynucleotide. Such methods aredescribed in detail in, for example, the previously cited referencesincluding U.S. Pat. No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat.No. 6,180,351, U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S.patent application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren etal., and the references cited therein. These references are incorporatedherein by reference. Other drop deposition methods can be used forfabrication, as previously described herein.

A feature of the subject arrays is that they include one or more,usually a plurality of, oligonucleotide probes predicted by thestatistical methods described herein. The oligonucleotide probesselected according to the subject methods are suitable for use in aplurality of different gene expression or genomic microarrayapplications. The statistical regression method evaluates probeperformance, without using any assumptions about the functionalrelationship between the oligonucleotide sequence and the predictiveparameters. Oligonucleotide probes that “cluster” (i.e. consistentlyproduce the same response) will perform substantially similarly under aplurality of different experimental conditions.

The arrays as described herein can be used in a variety of differentmicroarray applications, including gene expression experiments andgenomic analysis. In using an array, the array will typically be exposedto a sample (for example, a fluorescently labeled analyte, such as asample containing genomic DNA) and the array then read. Reading of thearray may be accomplished by illuminating the array and reading thelocation and intensity of resulting fluorescence at each feature of thearray to detect any binding complexes on the surface of the array. Forexample, a scanner may be used for this purpose that is similar to theAGILENT MICROARRAY SCANNER available from Agilent Technologies, PaloAlto, Calif. Other suitable apparatus and methods are described in U.S.patent application Ser. No. 09/846,125 “Reading Multi-Featured Arrays”by Dorsel et al.; and Ser. No. 09/430,214 “Interrogating Multi-FeaturedArrays” by Dorsel et al. As previously mentioned, these references areincorporated herein by reference.

However, arrays may be read by any other method or apparatus than theforegoing, with other reading methods including other optical techniques(for example, detecting chemiluminescent or electroluminescent labels)or electrical techniques (where each feature is provided with anelectrode to detect hybridization at that feature in a manner disclosedin U.S. Pat. No. 6,221,583 and elsewhere). Results from the reading maybe raw results (such as fluorescence intensity readings for each featurein one or more color channels) or may be processed results such asobtained by rejecting a reading for a feature which is below apredetermined threshold and/or forming conclusions based on the patternread from the array (such as whether or not a particular target sequencemay have been present in the sample or an organism from which a samplewas obtained exhibits a particular condition). The results of thereading (processed or not) may be forwarded (such as by communication)to a remote location if desired, and received there for further use(such as further processing).

Systems

The methods described herein are carried out in part with the aid of acomputer-based system, driven by software specific to the methods. A“computer-based system” refers to the hardware, software, and datastorage used to analyze the information of the present disclosure.Typical hardware of the computer-based systems of the present disclosurecomprises a central processing unit (CPU), input, output, and datastorage. A skilled artisan can readily appreciate that any one of thecurrently available computer-based system are suitable for use in thepresent disclosure. The data storage means may comprise any manufacturecomprising a recording of the present information as described above, ora memory access means that can access such a manufacture. In certaininstances a computer-based system may include one or more wirelessdevices.

Data from at least one of the detecting and deriving steps, as describedabove, is transmitted to a remote location. By “remote location” ismeant a location other than the location at which the array is presentand hybridization occur. For example, a remote location could be anotherlocation (e.g. office, lab, etc.) in the same city, another location ina different city, another location in a different state, anotherlocation in a different country, etc. As such, when one item isindicated as being “remote” from another, what is meant is that the twoitems are at least in different buildings, and may be at least one mile,ten miles, or at least one hundred miles apart. “Communicating”information means transmitting the data representing that information aselectrical signals over a suitable communication channel (for example, aprivate or public network). “Forwarding” an item refers to any means ofgetting that item from one location to the next, whether by physicallytransporting that item or otherwise (where that is possible) andincludes, at least in the case of data, physically transporting a mediumcarrying the data or communicating the data. The data may be transmittedto the remote location for further evaluation and/or use. Any convenienttelecommunications means may be employed for transmitting the data,e.g., facsimile, modem, internet, etc.

To “record” data, programming or other information on acomputer-readable medium refers to a process for storing information ona recordable storage medium, using any such methods as known in the art.Examples include magnetic media such as hard drives, tapes, disks, andthe like. Optical media can include CDs, DVDs, and the like. Anyconvenient data storage structure may be chosen, based on the means usedto access the stored information. A variety of data processor programsand the formats can be used for storage, e.g., word processing textfile, database format, etc.

A “processor” references any hardware and/or software combination thatwill perform the functions required of it. For example, any processorherein may be a programmable digital microprocessor such as available inthe form of an electronic controller, mainframe, server or personalcomputer (desktop or portable). Where the processor is programmable,suitable programming can be communicated from a remote location to theprocessor, or previously saved in a computer program product (such as aportable or fixed computer readable storage medium, whether magnetic,optical or solid state device based). For example, a magnetic medium oroptical disk may carry the programming, and can be read by a suitablereader communicating with each processor at its corresponding station.

In aspects, the methods described herein are performed usingcomputer-readable media containing programming stored thereonimplementing the subject methods. The computer-readable media may be,for example, in the form of a computer disk or CD, a floppy disk, amagnetic “hard card”, a server, or any other computer-readable mediacapable of containing data or the like, stored electronically,magnetically, optically or by other means. Accordingly, storedprogramming embodying steps for carrying out the subject methods may betransferred to a computer such as a personal computer (PC), (i.e.accessible by a researcher or the like), by physical transfer of a CD,floppy disk, or like medium, or may be transferred using a computernetwork, server, or any other interface connection, e.g., the Internet.

In an embodiment, the system described herein may include a singlecomputer or the like with a stored algorithm capable of evaluating probeperformance, as described herein, i.e. a computational analysis systemthat performs statistical regression analysis on a set of training data.In certain embodiments, the system is further characterized in that itprovides a user interface, where the user interface presents to a userthe option of selecting among one or more different, or multipledifferent inputs. For example, in the systems described herein, the userhas the option of selecting various predictive parameters, such ascomposition factors, thermodynamic factors, kinetic factors, andmathematical combinations of such factors, as well as analogousparameters for the intended genomic targets. Computational systems thatmay be readily modified to become systems of the subject inventioninclude those described in U.S. Pat. No. 6,251,588, the disclosure ofwhich is incorporated herein by reference.

The various embodiments described above are provided by way ofillustration only and should not be construed to limit the invention.Those skilled in the art will readily recognize various modificationsand changes that may be made to the present methods without followingthe example embodiments and applications illustrated and describedherein, and without departing from the true spirit and scope of theclaims attached hereto.

1. A method for selecting an oligonucleotide probe for use on amicroarray, comprising: generating two or more candidate oligonucleotideprobes; analyzing the two or more candidate probes with one or moremetrics that indicate probe performance to obtain an individual probescore for each metric; combining the individual probe score for eachmetric into a single combined score for the probe; and selecting theprobe with a combined score closest to an optimal score value for use ona microarray, wherein the optimal score value is the score at, ornearest to, the highest end of a numerical scale of probe scores.
 2. Themethod of claim 1, wherein the optimal score value is about 1.0 on ascale of probe scores ranging from 0.0 to 1.0.
 3. The method of claim 1,wherein the optimal score value is about 100 on a scale of probe scoresranging from 50 to
 100. 4. The method of claim 1, wherein generating acandidate set of oligonucleotide probes comprises: selecting one or moretarget sequences within a region of interest; and tiling subsequences ofeach target sequence across each region of interest to generate thecandidate set of potential probes.
 5. The method of claim 4, furthercomprising: generating a large set of potential probes by tiling thetarget sequences in single base steps across the region of interest; andapplying pairwise reduction to reduce the number of probes by a factorof greater than about 2 and less than about
 1000. 6. The method of claim1, wherein the metrics used to analyze the candidate probes comprisedirect metrics, indirect metrics, in silico metrics, or combinationsthereof.
 7. The method of claim 6, wherein direct metrics used toanalyze the candidate probes comprise the changes in probe responsebased on experimentally measured quantities, further comprising knownchanges in copy number of a target molecule.
 8. The method of claim 6,wherein indirect metrics used to analyze the candidate probes comprisechanges in predicted probe response resulting from experimentallymeasured quantities for a target molecule.
 9. The method of claim 6,wherein indirect metrics used to analyze the candidate probes comprisechanges in predicted probe response measured using empiricalrelationships based on direct responses from other probe-target moleculeduplexes.
 10. The method of claim 6, wherein in silico metrics used toanalyze the candidate probes comprise changes in probe response based oncalculated quantities for a target molecule.
 11. The method of claim 6,wherein in silico metrics used to analyze the candidate probes comprisechanges in probe response measured using empirical relationships basedon direct responses from other probe-target molecule duplexes.
 12. Themethod of claim 1, wherein analyzing the candidate probes with one ormore metrics to obtain individual probe scores further comprises:calculating the slope for each candidate probe; plotting the slopeagainst the corresponding value for each of the metrics to obtain atrend curve; and fitting the trend curve with a polynomial function withorder n to generate an individual probe score.
 13. The method of claim12, wherein the order n of the polynomial function ranges from n=1 ton=20.
 14. The method of claim 1, wherein combining individual probescores for each metric to obtain a combined score comprises adding oraveraging the probe score for each metric.
 15. The method of claim 1,wherein combining the individual probe scores to obtain a combined scorecomprises fitting the scores with a linear additive multivariate fittingfunction.
 16. The method of claim 15, wherein combining the individualprobe scores further comprises fitting measured slope responses for awell-characterized training data set to a change in copy number.
 17. Themethod of claim 1, wherein combining the individual probe scores toobtain a combined score comprises fitting the scores with a linearmultiplicative curve-fitting function.
 18. The method of claim 17,wherein combining the individual probe scores further comprises:combining metrics in each category using a linear model to obtainintermediate scores; and multiplying together the intermediate scores togenerate the combined score.
 19. The method of claim 1, whereincombining the individual probe scores further comprises syntheticallymodifying the combined score to obtain probes with more robustperformance, the synthetic modification further comprising: generating acandidate set of probes; applying pairwise reduction to reduce thenumber of probes in the candidate set; calculating the slope for eachprobe; plotting the slope against the corresponding value for each ofthe metrics to obtain a trend curve; fitting the trend curve to generatea measured probe score; replacing the fitted trend curve with asynthetic curve; and using the synthetic curve to generate a predictedscore for each probe.
 20. The method of claim 1, wherein selecting theprobe for use in a microarray application comprises: combiningexperimentally measured scores with predicted scores to obtain acombined score value for the probe; and selecting the probe with acombined score value closest to an optimal score value, wherein theoptimal score value is the score at, or nearest to, the highest end of anumerical scale of probe scores.
 21. A computer-readable medium havingrecorded thereon a program that selects a probe for use in microarrayapplications according to the method of claim
 1. 22. A computationalanalysis system comprising the computer-readable medium according toclaim
 21. 23. A method of fabricating a nucleic acid microarray,comprising producing at least two different oligonucleotide probes on amicroarray substrate, wherein at least one of the two differentoligonucleotide probes is a probe selected according to the method ofclaim
 1. 24. A nucleic acid microarray produced according to the methodof claim 23.